Using LLMs To Summarize Personal Research

Using LLMs To Summarize Personal Research#

Our goal is to have LLM aid us in generating interview quetions for someone. I find that I’m constantly trying to ramp up to a person’s background and story when preparing to meet them.

There is a ton of awesome resources about a person online we can use

  • Twitter Profiles

  • Websites

  • Other Interviews (YouTube or Text)

Let’s bring all these together by first pulling the information and then generating questions or bullet points we can use as preparation.

First let’s import our packages! We’ll be using LangChain to help us interact with OpenAI

# Unzip data folder

import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
    zip_ref.extractall('..')
# LLMs
from langchain import PromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

# Twitter
import tweepy

# Scraping
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md

# YouTube
from langchain.document_loaders import YoutubeLoader
# !pip install youtube-transcript-api

# Environment Variables
import os
from dotenv import load_dotenv

load_dotenv()
True

You’ll need a few API keys to complete the script below. It’s modular so if you don’t want to pull from Twitter feel free to leave those blank

TWITTER_API_KEY = os.getenv('TWITTER_API_KEY', 'YourAPIKeyIfNotSet')
TWITTER_API_SECRET = os.getenv('TWITTER_API_SECRET', 'YourAPIKeyIfNotSet')
TWITTER_ACCESS_TOKEN = os.getenv('TWITTER_ACCESS_TOKEN', 'YourAPIKeyIfNotSet')
TWITTER_ACCESS_TOKEN_SECRET = os.getenv('TWITTER_ACCESS_TOKEN_SECRET', 'YourAPIKeyIfNotSet')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'YourAPIKeyIfNotSet')

For this tutorial, let’s pretend we are going to be interviewing Elad Gil since he has a bunch of content online

Pulling Data From Twitter#

Great, now let’s set up a function that will pull tweets for us. This will help us get current events that the user is talking about. I’m excluding replies since they usually don’t have a ton of high signal text from the user. This is the same code that was used in the Twitter AI Bot tutorial.

def get_original_tweets(screen_name, tweets_to_pull=80, tweets_to_return=80):
    
    # Tweepy set up
    auth = tweepy.OAuthHandler(TWITTER_API_KEY, TWITTER_API_SECRET)
    auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
    api = tweepy.API(auth)

    # Holder for the tweets you'll find
    tweets = []
    
    # Go and pull the tweets
    tweepy_results = tweepy.Cursor(api.user_timeline,
                                   screen_name=screen_name,
                                   tweet_mode='extended',
                                   exclude_replies=True).items(tweets_to_pull)
    
    # Run through tweets and remove retweets and quote tweets so we can only look at a user's raw emotions
    for status in tweepy_results:
        if hasattr(status, 'retweeted_status') or hasattr(status, 'quoted_status'):
            # Skip if it's a retweet or quote tweet
            continue
        else:
            tweets.append({'full_text': status.full_text, 'likes': status.favorite_count})

    
    # Sort the tweets by number of likes. This will help us short_list the top ones later
    sorted_tweets = sorted(tweets, key=lambda x: x['likes'], reverse=True)

    # Get the text and drop the like count from the dictionary
    full_text = [x['full_text'] for x in sorted_tweets][:tweets_to_return]
    
    # Convert the list of tweets into a string of tweets we can use in the prompt later
    users_tweets = "\n\n".join(full_text)
            
    return users_tweets

Ok cool, let’s try it out!

user_tweets = get_original_tweets("eladgil")
print (user_tweets[:300])
More AI companies with sudden virality + paying customers should just bootstrap

0. Running co for cash may be best success

1. If it does scale, being profitable or near to it creates lot of options

2. it may not scale, or only work for a few months

3. Why get on the… https://t.co/Q9TRQo4yau

Som

Awesome, now we have a few tweets let’s move onto pulling data from a web page or two.

Pulling Data From Websites#

Let’s do two pages

  1. His personal website which has his background - https://eladgil.com/

  2. One of my favorite blog posts from him around AI defensibility & moats - https://blog.eladgil.com/p/defensibility-and-competition

First let’s create a function that will scrape a website for us.

We’ll do this by pulling the raw html, put it in a BeautifulSoup object, then convert that object to Markdown for better parsing

def pull_from_website(url):
    
    # Doing a try in case it doesn't work
    try:
        response = requests.get(url)
    except:
        # In case it doesn't work
        print ("Whoops, error")
        return
    
    # Put your response in a beautiful soup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Get your text
    text = soup.get_text()

    # Convert your html to markdown. This reduces tokens and noise
    text = md(text)
     
    return text
# I'm going to store my website data in a simple string.
# There is likely optimization to make this better but it's a solid 80% solution

website_data = ""
urls = ["https://eladgil.com/", "https://blog.eladgil.com/p/defensibility-and-competition"]

for url in urls:
    text = pull_from_website(url)
    
    website_data += text

Awesome, now that we have both of those data sources, let’s check out a sample

print (website_data[:400])
Elad Gil




Welcome to Elad Gil's retro homepage!

 Who? I am a technology entrepreneur. LinkedIn profile is here.
What?
I am an investor or advisor to companies including Airbnb, Airtable, Anduril, Brex, Checkr, Coinbase, dbt Labs, Deel, Figma, Flexport, Gitlab, Gusto, Instacart, Navan, Notion, Opendoor, PagerDuty, Pinterest, Retool, Rippling, Samsara, Square, Stripe
I am involved with AI com

Awesome, to round us off, let’s get the information from a youtube video. YouTube has tons of data like Podcasts and interviews. This will be valuable for us to have.

Pulling Data From YouTube#

We’ll use LangChains YouTube loaders for this. It only works if there is a transcript on the YT video already, if there isn’t then we’ll move on. You could get the transcript via Whisper if you really wanted to, but that’s out of scope for today.

We’ll make a function we can use to loop through videos

# Pulling data from YouTube in text form
def get_video_transcripts(url):
    loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
    documents = loader.load()
    transcript = ' '.join([doc.page_content for doc in documents])
    return transcript
# Using a regular string to store the youtube transcript data
# Video selection will be important.
# Parsing interviews is a whole other can of worms so I'm opting for one where Elad is mostly talking about himself
video_urls = ['https://www.youtube.com/watch?v=nglHX4B33_o']
videos_text = ""

for video_url in video_urls:
    video_text = get_video_transcripts(video_url)
    
    videos_text += video_text

Let’s look at at sample from the video

print(video_text[:300])
I like to say that startups are an act of desperation and the desperation went out of the ecosystem over the last two or three years and we just had people showing up for the status and the money and now I think it's getting back to people who are doing it for a variety of reasons including the impa

Awesome now that we have all of our data, let’s combine it together into a single information block

user_information = user_tweets + website_data + video_text

Our user_information variable is a big messy wall of text. Ideally we would clean this up more and try to increase the signal to noise ratio. However for this project we’ll just focus on the core use case of gathering data.

Next we’ll chunk our wall of text into pieces so we can do a map_reduce process on it. If you want learn more about techniques to split up your data check out my video on OpenAI Token Workarounds

# First we make our text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=2000)
# Then we split our user information into different documents
docs = text_splitter.create_documents([user_information])
# Let's see how many documents we created
len(docs)
3

Because we have a special requset for the LLM on our data, I want to make custom prompts. This will allow me to tinker with what data the LLM pulls out. I’ll use Langchain’s load_summarize_chain with custom prompts to do this. We aren’t making a summary, but rather just using load_summarize_chain for its easy mapreduce functionality.

First let’s make our custom map prompt. This is where we’ll instruction the LLM that it will pull out interview questoins and what makes a good question.

map_prompt = """You are a helpful AI bot that aids a user in research.
Below is information about a person named {persons_name}.
Information will include tweets, interview transcripts, and blog posts about {persons_name}
Your goal is to generate interview questions that we can ask {persons_name}
Use specifics from the research when possible

% START OF INFORMATION ABOUT {persons_name}:
{text}
% END OF INFORMATION ABOUT {persons_name}:

Please respond with list of a few interview questions based on the topics above

YOUR RESPONSE:"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text", "persons_name"])

Then we’ll make our custom combine promopt. This is the set of instructions that we’ll LLM on how to handle the list of questions that is returned in the first step above.

combine_prompt = """
You are a helpful AI bot that aids a user in research.
You will be given a list of potential interview questions that we can ask {persons_name}.

Please consolidate the questions and return a list

% INTERVIEW QUESTIONS
{text}
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text", "persons_name"])

Let’s create our LLM and chain. I’m increasing the color a bit for more creative language. If you notice that your questions have hallucinations in them, turn temperature to 0

llm = ChatOpenAI(temperature=.25, model_name='gpt-4')

chain = load_summarize_chain(llm,
                             chain_type="map_reduce",
                             map_prompt=map_prompt_template,
                             combine_prompt=combine_prompt_template,
#                              verbose=True
                            )

Ok, finally! With all of our data gathered and prompts ready, let’s run our chain

output = chain({"input_documents": docs, # The seven docs that were created before
                "persons_name": "Elad Gil"
               })
Warning: model not found. Using cl100k_base encoding.
print (output['output_text'])
1. As an investor and advisor to various AI companies, what are some common challenges you've observed in the industry, and how do you recommend overcoming them?

2. Can you elaborate on the advantages of bootstrapping for AI startups and share any success stories you've come across?

3. What are some key lessons you've learned from your experiences in high-profile companies like Twitter, Google, and Color Health that have shaped your approach to investing and advising startups?

4. How do you think AI will continue to shape the job market in the coming years?

5. What motivated you to enter the healthcare space as a co-founder of Color Health, and how do you envision the role of AI in improving healthcare outcomes?

6. Can you share some insights on what sets high growth companies apart from others and the key factors that contribute to their rapid growth?

7. How do you evaluate the defensibility of AI startups when considering investment or advisory opportunities?

8. What excites you the most about the future of AI, and what challenges do you foresee in its development and implementation?

9. Can you share your vision for Color Health and how it aims to revolutionize the healthcare industry?

10. What were the key challenges you faced during the rapid growth of Twitter, and how did you overcome them?

11. What advice would you give to founders looking to build defensibility into their startups from the beginning?

12. Can you share an example of a company that has successfully maintained a user-centric focus and how it has contributed to their success?

13. How do you see the balance between serving customer needs and building defensibility evolving in the future of AI-driven products and services?

14. Can you elaborate on the factors that contribute to your prediction of 2023 being a rough year for mid to late-stage private technology companies and how startups can prepare for these challenges?

15. What do you think are the most promising applications of large language models like GPT in the near future, and how can startups leverage them for growth?

16. How do you see the open versus closed structure playing out in the AI industry, and what implications could it have for startups and established companies in the AI space?

17. How do you think the costs involved in training large language models like GPT-3 and GPT-4 will affect competition and innovation in the AI industry, particularly for startups with limited resources?

18. What do you think are the key factors driving growth in the space and defense technology sector, and what opportunities do you see for startups in this industry?

19. How do you envision the future of defense tech startups, and what challenges do they need to overcome to succeed in this competitive landscape?

20. What lessons can other startups in the defense sector learn from Anduril's success, and how can they apply these strategies to their own businesses?

Awesome! Now we have some questions we can iterate on before we chat with the person. You can swap out different sources for different people.

These questions won’t be 100% ‘copy & paste’ ready, but they should serve as a really solid starting point for you to build on top of.

Next, let’s port this code over to a Streamlit app so we can share a deployed version easily